Unsupervised record matching with noisy and incomplete data

نویسندگان

  • Yves van Gennip
  • Blake Hunter
  • Anna Ma
  • Daniel Moyer
  • Ryan de Vera
  • Andrea L. Bertozzi
چکیده

We consider the problem of duplicate detection: given a large data set in which each entry has multiple attributes, detect which distinct entries refer to the same real world entity. Our method consists of three main steps: creating a similarity score between entries, grouping entries together into ‘unique entities’, and refining the groups. We compare various methods for creating similarity scores, considering different combinations of string matching, term frequency-inverse document frequency methods, and n-gram techniques. In particular, we introduce a vectorized soft term frequency-inverse document frequency method, with an optional refinement step. We test our method on the Los Angeles Police Department Field Interview Card data set, the Cora Citation Matching data set, and two sets of restaurant review data. The results show that in certain Y. van Gennip University of Nottingham E-mail: [email protected] B. Hunter Claremont McKenna College E-mail: [email protected] A. Ma Claremont Graduate University E-mail: [email protected] D. Moyer University of Southern California E-mail: [email protected] R. de Vera formerly California State University, Long Beach E-mail: [email protected] A. L. Bertozzi University of California, Los Angeles E-mail: [email protected] parameter ranges soft term frequency-inverse document frequency methods can outperform the standard term frequency-inverse document frequency method; they also confirm that our method for automatically determining the number of groups typically works well in many cases and allows for accurate results in the absence of a priori knowledge of the number of unique entities in the data set.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Adaptive Approximate Record Matching

Typographical data entry errors and incomplete documents, produce imperfect records in real world databases. These errors generate distinct records which belong to the same entity. The aim of Approximate Record Matching is to find multiple records which belong to an entity. In this paper, an algorithm for Approximate Record Matching is proposed that can be adapted automatically with input error...

متن کامل

Feature Curve Co-Completion in Noisy Data

Feature curves on 3D shapes provide important hints about significant parts of the geometry and reveal their underlying structure. However, when we process real world data, automatically detected feature curves are affected by measurement uncertainty, missing data, and sampling resolution, leading to noisy, fragmented, and incomplete feature curve networks. These artifacts make further processi...

متن کامل

Damage detection of skeletal structures using particle swarm optimizer with passive congregation (PSOPC) algorithm via incomplete modal data

This paper uses a PSOPC model based non-destructive damage identification procedure using frequency and modal data. The objective function formulation for the minimization problem is based on the frequency changes. The method is demonstrated by using a cantilever beam, four-bay plane truss and two-bay two-story plane frame with different scenarios. In this study, the modal data are provided nume...

متن کامل

A Nonlinear Grayscale Morphological and Unsupervised method for Human Facial Synthesis Based on an Example Image

Human facial generation of example image is used as a requirement for biometric applications for the purpose of identifying individuals. In this paper, face generation consists of three main steps. In the first step, detection of significant lines and edges of the example image are carried out using nonlinear grayscale morphology. Then, hair areas are identified from the face of sample. The fin...

متن کامل

Unsupervised Blocking of Imbalanced Datasets for Record Matching

Record matching in data engineering refers to searching for data records originating from same entities across different data sources. The solutions for record matching usually employ learning algorithms to train a classifier that labels record pairs as either matches or nonmatches. In practice, the amount of non-matches typically far exceeds the amount of matches. This problem is so-called imb...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1704.02955  شماره 

صفحات  -

تاریخ انتشار 2017